Hi Shinya,
I uploaded the first part of count matrices of ROSMAP single-nucleus RNAseq to Google Drive. I have sent invitation to the Google folder separately.
Because my storage space is limited, I will transfer data twice. The first part contains 240 folders and 1 annotation file. Please let me know when you finish downloading them. I will delete them and upload the second part.
Each folder is named for ROSMAP projid and contains a count matrix of nuclei from the individual.
matrix.mtx.gz: UMI count matrix in the MatrixMarket format.
barcodes.tsv.gz: droplet barcodes, batch, WGS ID, and projid of nuclei.
features.tsv.gz: Gene ID and gene symbols.
You can read them using the Seurat package of R.
> library(Seurat)
> dat <- ReadMtx(mtx="matrix.mtx.gz", cells="barcodes.tsv.gz", features="features.tsv.gz")
> seu <- CreateSeuratObject(dat)
annotation.2022-01-24.tsv.gz is the latest version of our cell type annotation. This one will be mostly stable, but our collaborator may still add minor tweaks in a next few months. There are several levels of cell type annotations, but we are using cell.type as the top-level cell type, and state as their subtype classifications.
Brief description of data processing:
Nuclei were isolated from 479 DLPFC tissues of ROSMAP. Tissues were processed as 60 batches, and each batch consisted from 8 donors. In each batch, nuclei suspension of 8 donors were mixed together, and single-nucleus RNAseq library was prepared using 10x Genomics 3 Gene Expression kit (v3 chemistry). The libraries were sequenced, and read mapped and UMI counting were performed using CellRanger v6.0.0 with GENCODE v32 and GRCh38.p13. Original donors of droplets in each batch were inferred by comparing SNPs in RNA reads with ROSMAP WGS VCF using genetic demultiplexing software demuxlet. As a quality control, genotype concordance of RNA and WGS, sex check, duplicated donors, WGS QC, and sequencing depth were assessed, and 424 donors passed the QC. To annotate cell types based on single nucleus expression, nuclei were classified into 8 major cell types, and each major cell type were analyzed separately. Doublets were removed using DoubletFinder, and cells were clustered using Seurat.
Best,
Masashi
data_dir = "/pastel/resources/20220203_snRNAseq_AMPAD/"# Read phenotype
pheno = readRDS("/pastel/resources/phenotypes/basic_Apr2019.rds")
# Read annotation
annotation = read_tsv(paste0(data_dir,"annotation.2022-01-24.tsv.gz"), show_col_types = FALSE)
# length(unique(annotation$projid)) # 438
# List samples included
files_list = list.dirs(path = data_dir, full.names = F)
files_list = files_list[!files_list==""]
metadata_df = data.frame(projid = files_list)
# length(unique(metadata_df$projid)) # 424
pheno.filt = pheno$data %>% filter(projid %in% unique(metadata_df$projid))
annotation.filt = annotation %>% filter(projid %in% unique(metadata_df$projid))
cell_types_df = as.data.frame(table(annotation$cell.type)) %>% arrange(-Freq) %>% mutate(Fraction = scales::percent(Freq/sum(Freq))) %>% rename(CellType = Var1)
cell_types_dfCellType <fct> | Freq <int> | Fraction <chr> | ||
|---|---|---|---|---|
| Excitatory Neurons | 671335 | 40.37335% | ||
| Oligodendrocytes | 333147 | 20.03510% | ||
| Inhibitory Neurons | 257929 | 15.51157% | ||
| Astrocyte | 228925 | 13.76730% | ||
| Microglia | 83729 | 5.03537% | ||
| OPCs | 65635 | 3.94722% | ||
| Endothelial | 10484 | 0.63050% | ||
| Fibroblast | 3705 | 0.22281% | ||
| Pericytes | 3132 | 0.18836% | ||
| Macrophages | 2131 | 0.12816% |
We will read each sample’ counts table, and create pseudo-bulk tables for each cell type (at least >1%).
cell_types_df_to_consider = cell_types_df %>% filter(Freq/sum(Freq) > 0.01)
cell_type_ids = list()
for(cell in cell_types_df_to_consider$CellType){
cell_type_ids[[cell]] <- annotation[annotation$cell.type == cell, ]$barcode
}
read_counts <- function(sample){
#sample = metadata_df$projid[1]
mtx_file = paste0(data_dir,sample,"/matrix.mtx.gz")
cells_file = paste0(data_dir,sample,"/barcodes.tsv.gz")
features_file = paste0(data_dir,sample,"/features.tsv.gz")
dat <- ReadMtx(mtx=mtx_file, cells=cells_file, features=features_file, feature.column = 1)
pseudo_counts <- list()
for(cell in names(cell_type_ids)){
#cell = names(cell_type_ids)[1]
barcodes = annotation[annotation$cell.type == cell,]$barcode
cell_selected = which(colnames(dat)%in%barcodes)
pseudo_counts[[cell]] <- rowSums(dat[,cell_selected])
}
pseudo_counts$sample = sample
return(pseudo_counts)
}
pseudo_counts_list <- furrr::future_map(metadata_df$projid, read_counts, .options = furrr_options(seed = 123))
gene_names = names(pseudo_counts_list[[1]]$`Excitatory Neurons`)
sample_names = unique(metadata_df$projid)
ext_pseudo_counts = as.data.frame(matrix(data = NA, nrow = length(gene_names), ncol = length(sample_names), dimnames = list(gene_names,sample_names)))
inh_pseudo_counts = as.data.frame(matrix(data = NA, nrow = length(gene_names), ncol = length(sample_names), dimnames = list(gene_names,sample_names)))
oli_pseudo_counts = as.data.frame(matrix(data = NA, nrow = length(gene_names), ncol = length(sample_names), dimnames = list(gene_names,sample_names)))
ast_pseudo_counts = as.data.frame(matrix(data = NA, nrow = length(gene_names), ncol = length(sample_names), dimnames = list(gene_names,sample_names)))
mic_pseudo_counts = as.data.frame(matrix(data = NA, nrow = length(gene_names), ncol = length(sample_names), dimnames = list(gene_names,sample_names)))
opc_pseudo_counts = as.data.frame(matrix(data = NA, nrow = length(gene_names), ncol = length(sample_names), dimnames = list(gene_names,sample_names)))
for(sample_counts in pseudo_counts_list){
#sample_counts = pseudo_counts_list[[1]]
ext_pseudo_counts[names(sample_counts$`Excitatory Neurons`),sample_counts$sample] <- sample_counts$`Excitatory Neurons`
inh_pseudo_counts[names(sample_counts$`Inhibitory Neurons`),sample_counts$sample] <- sample_counts$`Inhibitory Neurons`
oli_pseudo_counts[names(sample_counts$Oligodendrocytes),sample_counts$sample] <- sample_counts$Oligodendrocytes
ast_pseudo_counts[names(sample_counts$Astrocyte),sample_counts$sample] <- sample_counts$Astrocyte
mic_pseudo_counts[names(sample_counts$Microglia),sample_counts$sample] <- sample_counts$Microglia
opc_pseudo_counts[names(sample_counts$OPCs),sample_counts$sample] <- sample_counts$OPCs
}
# Just checking if total counts are the same
for(sample_counts in pseudo_counts_list){
if (!(sum(ast_pseudo_counts[,sample_counts$sample]) == sum(sample_counts$Astrocyte))){
print(sample_counts$sample)
}
}
save(ext_pseudo_counts,inh_pseudo_counts,oli_pseudo_counts,ast_pseudo_counts,mic_pseudo_counts,opc_pseudo_counts,
file = paste0(data_dir,"pseudo_counts.RData"))library(tidyverse)
library(rtracklayer)
library(GenomicFeatures)
library(BSgenome.Hsapiens.UCSC.hg38)
library(DESeq2)
# process gene and transcript annotation #####
# Using: GENCODE v32 and GRCh38.p13
file = "/pastel/resources/annotation/gencode.v32.primary_assembly.annotation.gtf.gz"
gtf = import(file)
gtf = as.data.frame(gtf)
gtf %>% filter(type%in%c("transcript")) -> gtf.transcripts
gtf %>% filter(type%in%c("gene")) -> gtf.genes
txdb <- makeTxDbFromGFF(file)
ebg <- exonsBy(txdb, by="gene")
exons <- reduce(ebg)
exons <- exons[ gtf.genes$gene_id ]
seqnames(Hsapiens) <- gsub("(.*?)_(.*)","\\2",gsub("v","\\.",gsub("(.*?)_(.*?)_(.*)","\\2",seqnames(Hsapiens))))
dna <- extractTranscriptSeqs(Hsapiens, exons)
gtf.genes$gc_exon <- as.numeric(letterFrequency(dna, "GC", as.prob=TRUE))
gtf.genes$len_exon <- sum(width(ranges(exons)))
# save(gtf.genes, file = "/pastel/resources/annotation/gencode.v32.primary_assembly.ann.RData")From Bryois et al medRxiv
Pseudo-bulk gene expression matrices were then generated by summing all counts for each gene in each patient in each cell type and normalized by scaling the total counts per patient for each cell type to 1 million.
Here:
library(edgeR)
library(cqn)
load("/pastel/resources/annotation/gencode.v32.primary_assembly.ann.RData")
load(paste0(data_dir,"pseudo_counts.RData"))
gtf.genes = gtf.genes %>% mutate(ensembl_id_nover = gsub("(.*)\\.(.*)","\\1",gene_id))
## Batch info
sample_batch = annotation %>% filter(projid %in% metadata_df$projid) %>% dplyr::select(projid, batch) %>% distinct() %>%
mutate(batch_0 = gsub("(.*)-(.*)","\\1",batch), batch_1 = gsub("(.*?)-(.*)","\\1",batch), batch_2 = gsub("(.*?)-(.*?)-(.*)","\\2",batch)) %>% distinct()
pseudo_counts = list("ext" = ext_pseudo_counts,
"inh" = inh_pseudo_counts,
"oli" = oli_pseudo_counts,
"ast" = ast_pseudo_counts,
"mic" = mic_pseudo_counts,
"opc" = opc_pseudo_counts)
celltype_exp = list()
for (cell_type in names(pseudo_counts)){
# cell_type = names(pseudo_counts)[5] # debug
cell_data = list()
counts = pseudo_counts[[cell_type]]
# Filter genes by CPM
counts_cpm = as.data.frame(cpm(counts))
keep.exp <- rowSums(counts_cpm > 1) >= (0.8 * ncol(counts_cpm) )
counts_cpm_filt = counts_cpm[keep.exp,]
counts_filt = counts[keep.exp,]
gtf.genes_filt = gtf.genes[match(rownames(counts_filt), gtf.genes$ensembl_id_nover),]
# all(rownames(counts_filt)==gtf.genes_filt$ensembl_id_nover) # TRUE?
# TPM
tpm_x = counts_filt/gtf.genes_filt$len_exon
tpm = t(t(tpm_x)*1e6/colSums(tpm_x))
sizeFactors = colSums(counts_filt, na.rm = T)
cqn.subset <- cqn(counts_filt, lengths = gtf.genes_filt$len_exon, x = gtf.genes_filt$gc_exon, sizeFactors = sizeFactors, verbose = TRUE)
RPKM.cqn.log2 <- cqn.subset$y + cqn.subset$offset
# convert to count
adjusted.count <- 2^RPKM.cqn.log2 * rep(colSums(counts_filt), nrow(RPKM.cqn.log2)) %>%
matrix(., ncol=ncol(counts_filt), nrow=nrow(counts_filt), byrow = T)
adjusted.count = adjusted.count/10^6 - 1
adjusted.count[adjusted.count<0]=0
# quantile-voom
counts_voom = voom(counts = adjusted.count, normalize.method="quantile")
quantile_voom = as.data.frame(counts_voom$E)
# TMM-voom
norm <- calcNormFactors(counts_filt, method = "TMM")
dge = DGEList(counts_filt, norm.factors = norm)
counts_voom <- voom(dge)
tmm_voom <- as.data.frame(counts_voom$E)
cell_data$counts <- counts
cell_data$counts_cpm <- counts_cpm_filt
cell_data$counts_tpm <- tpm
cell_data$cqn <- cqn.subset
cell_data$RPKM.cqn.log2 <- RPKM.cqn.log2
cell_data$adjusted.count <- adjusted.count
cell_data$quantile_voom <- quantile_voom
cell_data$tmm_voom <- tmm_voom
celltype_exp[[cell_type]] <- cell_data
}
save(celltype_exp, pheno.filt, annotation.filt, file = paste0(data_dir,"celltype_exp.RData"))Summary based on cqn adj. counts
load(paste0(data_dir,"celltype_exp.RData"))
df_summary = data.frame()
for (cell in names(celltype_exp)){
#cell = names(celltype_exp)[1]
dat = celltype_exp[[cell]]
dat2 = as.data.frame(dat$adjusted.count)
dat_df = data.frame(cell_type = cell,
ngenes = nrow(dat2),
nsamples = ncol(dat2),
avg_libsize = mean(colSums(dat2)),
avg_counts = mean(rowMeans(dat2)))
df_summary = rbind(df_summary, dat_df)
}
df_summarycell_type <chr> | ngenes <int> | nsamples <int> | avg_libsize <dbl> | avg_counts <dbl> |
|---|---|---|---|---|
| ext | 16836 | 424 | 40891078.0 | 2428.78819 |
| inh | 16467 | 424 | 13641449.2 | 828.41132 |
| oli | 15859 | 424 | 2234842.6 | 140.91952 |
| ast | 16383 | 424 | 4482531.5 | 273.60871 |
| mic | 14395 | 424 | 621604.4 | 43.18197 |
| opc | 15364 | 424 | 1045919.5 | 68.07599 |
Comparison between distributions
library(ggpubr)
library(viridis)
cell_color_map = list(ext = ggsci::pal_nejm("default")(6)[1],
inh = ggsci::pal_nejm("default")(6)[2],
oli = ggsci::pal_nejm("default")(6)[3],
ast = ggsci::pal_nejm("default")(6)[4],
mic = ggsci::pal_nejm("default")(6)[5],
opc = ggsci::pal_nejm("default")(6)[6])
cell_labels = list(ext = "Excitatory Neurons",
inh = "Inhibitory Neurons",
oli = "Oligodendrocytes",
ast = "Astrocytes",
mic = "Microglia",
opc = "OPCs")
cell_labels_n = list(ext = paste0("Excitatory Neurons (genes = ", df_summary$ngenes[df_summary$cell_type=="ext"], ")"),
inh = paste0("Inhibitory Neurons (genes = ", df_summary$ngenes[df_summary$cell_type=="inh"], ")"),
oli = paste0("Oligodendrocytes (genes = ", df_summary$ngenes[df_summary$cell_type=="oli"], ")"),
ast = paste0("Astrocytes (genes = ", df_summary$ngenes[df_summary$cell_type=="ast"], ")"),
mic = paste0("Microglia (genes = ", df_summary$ngenes[df_summary$cell_type=="mic"], ")"),
opc = paste0("OPCs (genes = ", df_summary$ngenes[df_summary$cell_type=="opc"], ")"))
plot_exp_density_per_sample <- function(exp_matrix, title = "", subtitle = "", x_label = "value", color = "#4DBBD5FF"){
#exp_matrix <- cell_type$RPKM.cqn.log2
exp_matrix <- as.matrix(exp_matrix)
nsamples <- ncol(exp_matrix)
colfunc <- grDevices::colorRampPalette(c(color, alpha(color, 0.5)))
col = alpha(colfunc(nsamples), alpha = 0.1)
per_sample = reshape2::melt(exp_matrix)
density_p <- ggplot(per_sample, aes(x=value, color=as.factor(Var2))) +
geom_density(show.legend = FALSE) +
scale_fill_viridis_d() +
scale_color_manual(values=col) +
labs(x = x_label, y = "Density", title = title, subtitle = subtitle) +
theme_classic()
return(density_p)
}
dist_plots = list()
for(cell_type_i in names(celltype_exp)){
#cell_type_i = names(celltype_exp)[5] # debug
cell_type <- celltype_exp[[cell_type_i]]
a1 = plot_exp_density_per_sample(log2(cell_type$counts_cpm+1), subtitle = cell_labels_n[[cell_type_i]], x_label = expression(log[2]("CPM + 1")), color = cell_color_map[[cell_type_i]])
a2 = plot_exp_density_per_sample(log2(cell_type$counts_tpm+1), subtitle = cell_labels_n[[cell_type_i]], x_label = expression(log[2](TPM + 1)), color = cell_color_map[[cell_type_i]])
a3 = plot_exp_density_per_sample(cell_type$RPKM.cqn.log2, subtitle = cell_labels_n[[cell_type_i]], x_label = expression(log[2]("RPKM CQN")), color = cell_color_map[[cell_type_i]])
a4 = plot_exp_density_per_sample(cell_type$tmm_voom, subtitle = cell_labels_n[[cell_type_i]], x_label = expression("tmm-voom"), color = cell_color_map[[cell_type_i]])
dist_plots[[cell_type_i]] <- ggarrange(a1,a2,a3,a4, nrow = 1, ncol = 4)
}
ggarrange(plotlist = dist_plots, nrow = 6, ncol = 1, labels = "auto")sessionInfo()R version 4.1.2 (2021-11-01) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Stream 8
Matrix products: default BLAS/LAPACK: /usr/lib64/libopenblasp-r0.3.15.so
locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] viridis_0.6.2 viridisLite_0.4.0 ggpubr_0.4.0 furrr_0.2.3
[5] future_1.24.0 kableExtra_1.3.4 forcats_0.5.1 stringr_1.4.0
[9] dplyr_1.0.8 purrr_0.3.4 readr_2.1.2 tidyr_1.2.0
[13] tibble_3.1.6 ggplot2_3.3.5 tidyverse_1.3.1 SeuratObject_4.0.4 [17] Seurat_4.1.0
loaded via a namespace (and not attached): [1] readxl_1.4.0 backports_1.4.1 systemfonts_1.0.4
[4] plyr_1.8.7 igraph_1.3.0 lazyeval_0.2.2
[7] splines_4.1.2 listenv_0.8.0 scattermore_0.8
[10] digest_0.6.29 htmltools_0.5.2 fansi_1.0.3
[13] magrittr_2.0.3 tensor_1.5 cluster_2.1.2
[16] ROCR_1.0-11 tzdb_0.3.0 globals_0.14.0
[19] modelr_0.1.8 matrixStats_0.62.0 vroom_1.5.7
[22] svglite_2.1.0 spatstat.sparse_2.1-0 colorspace_2.0-3
[25] rvest_1.0.2 ggrepel_0.9.1 haven_2.4.3
[28] xfun_0.30 crayon_1.5.1 jsonlite_1.8.0
[31] spatstat.data_2.1-4 survival_3.2-13 zoo_1.8-9
[34] glue_1.6.2 polyclip_1.10-0 gtable_0.3.0
[37] webshot_0.5.2 leiden_0.3.9 car_3.0-12
[40] future.apply_1.8.1 abind_1.4-5 scales_1.2.0
[43] DBI_1.1.2 rstatix_0.7.0 spatstat.random_2.2-0 [46] miniUI_0.1.1.1 Rcpp_1.0.8.3 xtable_1.8-4
[49] reticulate_1.24 spatstat.core_2.4-2 bit_4.0.4
[52] htmlwidgets_1.5.4 httr_1.4.2 RColorBrewer_1.1-3
[55] ellipsis_0.3.2 ica_1.0-2 farver_2.1.0
[58] pkgconfig_2.0.3 sass_0.4.1 uwot_0.1.11
[61] dbplyr_2.1.1 deldir_1.0-6 utf8_1.2.2
[64] labeling_0.4.2 tidyselect_1.1.2 rlang_1.0.2
[67] reshape2_1.4.4 later_1.3.0 munsell_0.5.0
[70] cellranger_1.1.0 tools_4.1.2 cli_3.3.0
[73] generics_0.1.2 broom_0.7.12 ggridges_0.5.3
[76] evaluate_0.15 fastmap_1.1.0 yaml_2.3.5
[79] goftest_1.2-3 bit64_4.0.5 knitr_1.38
[82] fs_1.5.2 fitdistrplus_1.1-8 RANN_2.6.1
[85] pbapply_1.5-0 nlme_3.1-153 mime_0.12
[88] xml2_1.3.3 compiler_4.1.2 rstudioapi_0.13
[91] plotly_4.10.0 png_0.1-7 ggsignif_0.6.3
[94] spatstat.utils_2.3-0 reprex_2.0.1 bslib_0.3.1
[97] stringi_1.7.6 highr_0.9 lattice_0.20-45
[100] Matrix_1.3-4 ggsci_2.9 vctrs_0.4.1
[103] pillar_1.7.0 lifecycle_1.0.1 spatstat.geom_2.4-0
[106] lmtest_0.9-40 jquerylib_0.1.4 RcppAnnoy_0.0.19
[109] data.table_1.14.2 cowplot_1.1.1 irlba_2.3.5
[112] httpuv_1.6.5 patchwork_1.1.1 R6_2.5.1
[115] promises_1.2.0.1 KernSmooth_2.23-20 gridExtra_2.3
[118] parallelly_1.30.0 codetools_0.2-18 MASS_7.3-54
[121] assertthat_0.2.1 withr_2.5.0 sctransform_0.3.3
[124] mgcv_1.8-38 parallel_4.1.2 hms_1.1.1
[127] grid_4.1.2 rpart_4.1-15 rmarkdown_2.13
[130] carData_3.0-5 Rtsne_0.15 shiny_1.7.1
[133] lubridate_1.8.0